import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.graph_objects as go
%matplotlib inline
sns.set(rc={'figure.figsize':(12, 6)})
We'll start our process by reading in the scarped results and conducting some data transformations and other changes to clean the date.
records = pd.read_csv('scraped_results_df.csv')
records.info()
records.describe()
records.tail()
We've got some data cleaning to do and theres some additional data fields that I'd like to calculate. The below code cell cleans up some column names, separates the rank from the team name, determines a "rank_diff" score when ranked teams play each other, a "pts_diff" score to track the points difference in the game, and sets a multi-index on year, week of the season, and individual game.
records = pd.read_csv('scraped_results_df.csv')
# 'game_loc' indicated where the game was played at. Since the winner is listed first in
# the original data, the '@' indicates the game was played at the loser's home. Therefore,
# we can create a new column called 'Winner_home' if the '@' sign is not present.
records['winner_home'] = records['game_loc']!='@'
# The rank is included in the winner and losers name within parenthesis. The below regex will identify
# numerical digits within the parenthesis and extract them to a new column as 'floats'. We'll also
# remove the rank in parenthesis from the original winner column. We'll do this for winners and losers.
records['winner_rank'] = records['winner'].str.extract('\(([0-9]+)\)', expand=True).astype('float')
records['winner_name'] = records['winner'].str.replace('\(([0-9]+)\)', '').str.replace('\xa0', '')
records['loser_rank'] = records['loser'].str.extract('\(([0-9]+)\)', expand=True).astype('float')
records['loser_name'] = records['loser'].str.replace('\(([0-9]+)\)', '').str.replace('\xa0', '')
# Calculate a rank_diff socre. The more negative this is, the more of an upset it is.
records['rank_diff'] = records['loser_rank'] - records['winner_rank']
# Add a pts_diff between the two pts as we can use margin of victory to see how close a
# game is.
records['pts_diff'] = records['winner_pts'] - records['loser_pts']
# We no longer need several of these columns, so lets drop them.
records.drop(['Unnamed: 0', 'winner','loser'], axis=1, inplace=True)
records.set_index(['year','week_number', 'row'], inplace=True)
From the below plot, we can see a tremendous explosion in number of games in the 1970's. In fact in 1977, the NCAA split into 1A for larger schools with better football teams and 1AA for smaller schools. In recent years, part of the rise in number of games has been a proliferation of games played after the normal season, including bowl games and playoffs.
ax = records.groupby('year').size().plot(figsize=(12,6), title='Games Played Per Year')
ax.set_xlabel('Year')
This plot shows how many games are played in each week. The small number of games played after week 12 relative to the rest of the year highlights both a longer season in recent years and the growth in "post-season" bowl games, conference champions, and most recently play-off games
# total number of games per week
records.groupby('week_number').size().plot()
By looking at the distribution of winner points, loser points, difference between points, and the rank difference over the entire data, several trends emerge. First, winner points (the total number of points a winner scores) is mostly normally distributed. On the other hand, the loser's points and the points difference are dramatically skewed to the right with most losers scoring less than 15 points and points differences less then 20.
records[['winner_pts', 'loser_pts', 'pts_diff', 'rank_diff']].hist(figsize=(12,6))
By looking at points scored by the winners, losers, and the difference, we can see that over time more points have been scored by both teams. However, the average points difference has remained relatively stead over the last 70 years.
records.groupby('year').mean().plot(y=['winner_pts','loser_pts','pts_diff'], title="Points Scored, 1950 to 2018")
If we want to look at how "accurate" the polls have been when ranked teams play each other, we could look at the rank difference between the two teams. We would expect a positive score as that means the higher ranked team defeated the lower ranked team. The plot below shows the variability of the average rank difference score over the last 70 years.
records.groupby('year').mean().plot(y='rank_diff', title="Average Rank Difference By Year, 1950 to 2018")
If we look at the rank difference averaged across each week, we see a far higher accuracy in the first 15 weeks wieht a score averaging between +2 and +3, before dropping to almost -2. This is likley because it is far more difficult to pick the winner in closely matched games in the post-season when conference championships and bowl games occur.
records.groupby('week_number').mean().plot(y='rank_diff', title="Average Rank Difference By Week, 1950 to 2018")
By plotting points difference against the rank difference, we see that the more positive the rank difference, the greater the margin of victory. This makes sense when a higher ranked team plays and defeats a lower ranked team, the bigger difference in the rank leads to a more severe points difference. However, this scatter plot does not work well with dense data like this. Let's try using a hexplot from Seaborn.
records.plot(x='rank_diff', y='pts_diff', kind='scatter', c='black')
sns.jointplot(x='rank_diff', y='pts_diff', data=records, kind='hex', height=8)
Using the rank difference field, we can look for the biggest upsets between ranked teams. A more negative rank difference indicates a more dramatic upset. Next we can look at when unranked teams beat the number one ranked team.
# Biggest upsets between ranked teams
records[records['rank_diff'] < -18].sort_values('rank_diff')[:5]
We can look at when a winner is not ranked and the loser is ranked 1, which has happened 35 times in the data.
records[(records['winner_rank'].isna()) & (records['loser_rank'] == 1)]
We can also look at ties, when the points difference is 0. The dataset does not break out ties, but still lists one team as a winner. There are only 786 ties in the dataset up to 1995, when new rules instituted tie breakers. However, we need to remember that ties are not properly recorded in this dataset.
# ties
print(len(records[records['pts_diff']==0]))
(records[records['pts_diff']==0]).tail(5)
First, lets find out how many games there should be with at least one ranked team. There are 48,899 games, so the total numers of games with at least one ranked and no one ranked should be 48,899.
print('Number of total games:', len(records))
at_least_one_ranked = records[(records['winner_rank'] > 0) | (records['loser_rank'] > 0)]
print('Numbers of games where at least one team is ranked:', len(at_least_one_ranked))
no_ranked = records[(records['winner_rank'].isna()) & (records['loser_rank'].isna())]
print('Numbers of games where no team is ranked:', len(no_ranked))
both_ranked = records[(records['winner_rank'] > 0) & (records['loser_rank'] > 0)]
print('Numbers of games where both teams are ranked:', len(both_ranked))
Looks like there are two cases in our dataset where teams have the exact same rank. In these cases the teams were tied for these ranks AND played each other.
records[records['rank_diff']==0]
Looks good! Now lets make a new column for each of the different options for games. We'll treat these fields as booleans, which will allow us to easily compute the mean and size of each of these categories over years.
both_unranked = records[(records['winner_rank'].isna()) & (records['loser_rank'].isna())]
print('Number of times unranked teams play:', len(both_unranked))
records['both_unranked'] = (records['winner_rank'].isna()) & (records['loser_rank'].isna())
ranked_beats_unranked = records[(records['winner_rank'] > 0) & (records['loser_rank'].isna())]
print('Number of times a ranked team beats an unranked team:', len(ranked_beats_unranked))
records['ranked_beats_unranked'] = (records['winner_rank'] > 0) & (records['loser_rank'].isna())
unranked_beats_ranked = records[(records['winner_rank'].isna()) & (records['loser_rank'] > 0)]
print('Number of times an unranked team upsets a ranked team:', len(unranked_beats_ranked))
records['unranked_beats_ranked'] = (records['winner_rank'].isna()) & (records['loser_rank'] > 0)
lower_beats_higher = both_ranked[both_ranked['rank_diff'] < 0]
print('Number of times an a lower ranked team upsets a higher ranked team:', len(lower_beats_higher))
records['lower_beats_higher'] = (records['winner_rank'] > 0) & (records['loser_rank'] > 0) & (records['rank_diff'] < 0)
higher_beats_lower = both_ranked[both_ranked['rank_diff'] > 0]
print('Number of times an a higher ranked team beats a lower ranked team:', len(higher_beats_lower))
records['higher_beats_lower'] = (records['winner_rank'] > 0) & (records['loser_rank'] > 0) & (records['rank_diff'] > 0)
records.head()
Are there any trends to how often upsets happen? If pollsters were doing a good job, we would rarely see unranked teams beat ranked teams, only occasionally see a lower team beat a higher ranked team, usually see higher ranked teams beat lower, and almost always see ranked teams beat unranked teams. Since we dont have any insight into how much better the 26th best team is compared to the 125th team, we will not be concerned about the fifth type of interaction, unranked teams beating unranked teams.
ratios = records[['both_unranked', 'ranked_beats_unranked', 'higher_beats_lower',
'lower_beats_higher', 'unranked_beats_ranked', ]]
ratios_sum = ratios.groupby('year').sum()
ratios_mean = ratios.groupby('year').mean()
These colors are grouped by level of expected outcome. Blue are neutral games, darker colors are more extreme, greens are expected, and reds are unexpected. Therefore, an unranked team beating a ranked team is dark red as it is very unexpected.
colors = ['steelblue', 'darkgreen', 'forestgreen', 'salmon', 'firebrick']
We can look at the data as both an area chart of counts of the different types, and an area chart of percentage of types of interactions.
ax = ratios_sum.plot.area(figsize=(12,6), color=colors, title='Stacked Area Chart of Types of Games, Total Games, 1950 to 2018')
ax = ratios_mean.plot.area(figsize=(12,6), color=colors, title='Stacked Area Chart of Types of Games, Averages, 1950 to 2018')
However, these plots are still lacking in that there is a lot of information that is not easily accessible. An interactive plot will make it easier to see this data.
x=ratios_sum.index.to_list()
fig = go.Figure()
fig.add_trace(go.Scatter(
x=x, y=ratios_sum['both_unranked'],
name='Both Unranked',
hoverinfo='x+y',
mode='lines',
line=dict(width=0.5, color=colors[0]),
stackgroup='one' # define stack group
))
fig.add_trace(go.Scatter(
x=x, y=ratios_sum['ranked_beats_unranked'],
name='Ranked Beats Unranked',
hoverinfo='x+y',
mode='lines',
line=dict(width=0.5, color=colors[1]),
stackgroup='one' # define stack group
))
fig.add_trace(go.Scatter(
x=x, y=ratios_sum['higher_beats_lower'],
name='Higher Beats Lower Ranked',
hoverinfo='x+y',
mode='lines',
line=dict(width=0.5, color=colors[2]),
stackgroup='one' # define stack group
))
fig.add_trace(go.Scatter(
x=x, y=ratios_sum['lower_beats_higher'],
name='Lower Beats Higher Ranked',
hoverinfo='x+y',
mode='lines',
line=dict(width=0.5, color=colors[3]),
stackgroup='one' # define stack group
))
fig.add_trace(go.Scatter(
x=x, y=ratios_sum['unranked_beats_ranked'],
name='Unranked Beats Ranked',
hoverinfo='x+y',
mode='lines',
line=dict(width=0.5, color=colors[4]),
stackgroup='one' # define stack group
))
fig.update_layout(title="Stacked Area Chart of Types of Games, Total Games, 1950 to 2018")
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(
x=x, y=ratios_sum['both_unranked'],
name='Both Unranked',
hoverinfo='x+y',
mode='lines',
line=dict(width=0.5, color=colors[0]),
stackgroup='one', # define stack group
groupnorm='percent'
))
fig.add_trace(go.Scatter(
x=x, y=ratios_sum['ranked_beats_unranked'],
name='Ranked Beats Unranked',
hoverinfo='x+y',
mode='lines',
line=dict(width=0.5, color=colors[1]),
stackgroup='one', # define stack group
groupnorm='percent'
))
fig.add_trace(go.Scatter(
x=x, y=ratios_sum['higher_beats_lower'],
name='Higher Beats Lower Ranked',
hoverinfo='x+y',
mode='lines',
line=dict(width=0.5, color=colors[2]),
stackgroup='one', # define stack group
))
fig.add_trace(go.Scatter(
x=x, y=ratios_sum['lower_beats_higher'],
name='Lower Beats Higher Ranked',
hoverinfo='x+y',
mode='lines',
line=dict(width=0.5, color=colors[3]),
stackgroup='one', # define stack group
groupnorm='percent'
))
fig.add_trace(go.Scatter(
x=x, y=ratios_sum['unranked_beats_ranked'],
name='Unranked Beats Ranked',
hoverinfo='x+y',
mode='lines',
line=dict(width=0.5, color=colors[4]),
stackgroup='one', # define stack group
groupnorm='percent'
))
fig.update_layout(
showlegend=True,
yaxis=dict(
type='linear',
range=[1, 100],
ticksuffix='%'))
fig.update_layout(title="Stacked Area Chart of Types of Games, Percentage, 1950 to 2018")
fig.show()
We can look for the most 'accurate' years by computing the correct answers (ranked beats unranked and higher beats lower divided by all games with ranked teams) and the incorrect answers (unranked beats ranked and lower beats higher divided by all games with ranked teams). Then we can find the difference between the correct answers and incorrect answers to get a rough metric for accuracy.
ratios_years = ratios.groupby('year').mean()
ratios_years
total = ratios_years['ranked_beats_unranked'] + ratios_years['higher_beats_lower'] + ratios_years['lower_beats_higher'] + ratios_years['unranked_beats_ranked']
ratios_years['correct'] = (ratios_years['ranked_beats_unranked'] + ratios_years['higher_beats_lower'])/total
ratios_years['incorrect'] = (ratios_years['lower_beats_higher'] + ratios_years['unranked_beats_ranked'])/total
ratios_years['accuracy'] = ratios_years['correct'] - ratios_years['incorrect']
acc_df = ratios_years[['correct','incorrect','accuracy']]
acc_df.plot(figsize=(12,6), color=('g','r','black'), style=['--','--','-'], linewidth=3,
title="Correct, Incorrect, and Overall Accuracy of AP Poll in Games between Ranked Teams, 1950 to 2018")
acc_df.sort_values('accuracy')
Looks like the worst years for accuracy were in the 60s. College football at the time was still dealing with intense issues of segregation with many teams in the south still not allowing African-Americans to play. Additionally, some of these teams did not want to play racially integrated teams. This could be a reason for the inaccuracies of the polls.
labels = ['Both Unranked', 'Ranked Beats Unranked', 'Higher Beats Lower Ranked',
'Lower Bears Higher Ranked', 'Unranked Beats Ranked']
fig = go.Figure(data=[go.Pie(labels=labels, values=ratios.sum().to_list(), hole=.2,
direction = 'clockwise', sort=False, )])
fig.update_traces(hoverinfo='label+value', textinfo='percent', textfont_size=16,
marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.update_layout(title="Pie Chart of Types of Team Interactions, 1950 to 2018")
fig.show()
def plot_pie(year):
labels = ['Both Unranked', 'Ranked Beats Unranked', 'Higher Beats Lower Ranked',
'Lower Bears Higher Ranked', 'Unranked Beats Ranked']
fig = go.Figure(data=[go.Pie(labels=labels, values=ratios.groupby('year').mean().loc[year].to_list(), hole=.2,
direction = 'clockwise', sort=False, )])
fig.update_traces(hoverinfo='label+value', textinfo='percent', textfont_size=16,
marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.update_layout(title="Pie Chart of Types of Team Interaction, {}".format(year))
fig.show()
plot_pie(1965)
plot_pie(1973)
plot_pie(2018)
records_recent = records.loc[2000:2019]
records_recent.groupby('loser_name').mean()[(records_recent.groupby('loser_name').count()['rank_diff'] > 5)].sort_values('rank_diff', ascending=True).head()
records_recent.groupby('winner_name').mean()[(records_recent.groupby('winner_name').count()['rank_diff'] > 5)].sort_values('rank_diff', ascending=False).head()
records_recent[(records_recent['loser_name'] == 'Iowa State') & (records_recent['rank_diff'] > 0)]
records_recent[(records_recent['loser_name'] == 'Alabama') & (records_recent['rank_diff'] > 0)]
records_recent[(records_recent['loser_name'] == 'Texas') & (records_recent['rank_diff'] > 0)]